Build a Traffic Sign Recognition Project
Here are summary statistics of the traffic signs data sets:
As follow, the each datasets have different distributions of frequency.
In the training dataset, there is near ten times difference among 43 labels(classes).
Therefore it is possible that training under this dataset can not be equal condition for each labels.
Moreover, some labels may have shortage of sample number to adequately train.
The followings is each label's mean of pixel mean value.
It shows labels 6, 20, 10 and 8 have very dark images, less than about 50 pixel value.
The followings is each label's stdev of pixel mean value.
It shows labels 6, 21, 27, 21 and others have low deviation among same label sample images.
As explained above, the training dataset may have some issue to train like:
1. sample number shortage in some lables
2. low contrast(dark) images
3. low variance in some labels
Here is a quick look of typical images in the training dataset.
This Dataset has a lot of similar images that seem to be augmented via image processing techniques like changing brightness, contrast, chroma and cropping position.
Here are class example images(the first class image in the training dataset) and class average images.
All class average images still have their own characteristic enough to recognize as traffic signs.
But some classes seem to have some troubles.
The training dataset potentially has trouble factors as described above.
So I had feasibility tests before selecting methods for pre-processing, CNN design and augmenting image data in order to reduce the training data risk.
At the first, I made a reasonable scale model for the feasibility tests.
This model is bigger than the LeNet-5 on lesson 8 and would be smaller than the final model, so I named it "middle model".
Here are the specifications of the middle model and training parameters.
| Layer | Description |
|---|---|
| Input | 32x32x3 RGB/Gray image |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 28x28x16 |
| Batch Normalization | |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 14x14x16 |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 10x10x48 |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 5x5x48 |
| flatten | 5x5x48 => 1200 |
| Fully connected | outputs 100 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Fully connected | outputs 100 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Softmax | outputs 43 (class number) |
| Title | Description |
|---|---|
| Optimizer | Adam |
| learning_rate | 0.0002 |
| batch size | 100 |
| EPOCH Number | 200 |
Following figure shows a pixel mean value and stdev distribution for each training images.
To make training work better, following normalization types are possible.
Following figures show distributions of pixel mean value and stdev for each types.
After the normalization, the average images of each class are up as follow.
Relatively to the average images without normalization, described above, the dark brightness issue is declined by type 1 and 2 normalizations. But the low chroma and background texture issues still remain in the normlaized images.
These issues can be resolved by augmenting training data.
To check the potential of the middle mode, I examined 7 types of input data as follow.
| No | Title | image type | Normalization type |
|---|---|---|---|
| 0 | RGB | RGB-3ch | Not normalized |
| 1 | RGB-Type0 | RGB-3ch | normalized for all pixels in the training data |
| 2 | RGB-Type1 | RGB-3ch | normalized for each images pixels |
| 3 | RGB-Type2 | RGB-3ch | normalized for RGB each image plane pixels |
| 4 | Gray | Gray | Not normalized |
| 5 | Gray-Type0 | Gray | normalized for all pixels in the training data |
| 6 | Gray-Type1 | Gray | normalized for each images pixels |
Normalization is executed by a follwing equation.
normalized_image = (org_image - mean) / (2.0 * stdev)
After 200 epochs of training, every type obtained 93% over in validation accuracy as below.
All of the input type seem to get more accuracy after more training.
I decided to take RGB-type1 as the input format to study hereafter, however the feasibility test shows that gray scale gets better accuracy than RGB input.
All the 7 input format types, include RGB format, already satisfy the 93% accuracy goal of the project.
So I can challenge something like that can solove the low-chroma and the background texture issues above.
The RGB input may be useful to make sure what modification affects to the issues of the training dataset.
As "middle model" feasibility test, I got examples that this model could not work well on, as belows.
Some classes appear to have some problems, other than the known troublesome class,
Compare to numbers of training data, the failed classes don't seem to have enough training data.
This class validation data has very low-chroma images, but the training dataset for the class dosesn't have such images.
This class validation data has low resolution images, but the training dataset for the class doesn't have such images.
This class validation data has very dark and low contrast images, but the training dataset for the class doesn't have such images.
This class validation data has small traffic sign images, but the training dataset for the class doesn't have such images.
This class validation data has very dark and low contrast images, but the training dataset for the class doesn't have such images.
This class validation data has high contrast background images, but the training dataset for the class doesn't have such images.
I tried to enlargep the filter tap size of the first convolutional networks, because the quick looks above showed that "middle model" may not be enough to express textures inside traffic signs.
Following figure shows 4 model architecture's accuracy curve for each epoch.
"5x5" or "7x7" means CNN's tap size, and "0bn" or "1bn" means usage of batch normalization. ("0bn" is No batch normalization model)
No-batch-normalization models reached near their peak accuracy about at epoch 500.
Batch-normalization models had a low accuracy level, at least, before epoch 1000, though they have possibility of more high accuracy at over 1000 epochs.
It might be better for batch normalization models to take more high training-rate than no-batch-normalization models.
Here, to compare under eauql conditions, all the 4 models use 0.0002 as the training-rate.
As the feasibility test, I chose the final model as below.
I call the final model architecture "large model".
The unit numbers were set adequate value, watching varying histgram on the Tensorboard. (It's a fantastic tool!)
CNN's filter size 64 / 84 and FC's unit size 240 are moderate values that can get smooth histgrams of their weights.
The final model has two dropout to prevent overfitting.
| Layer | Description |
|---|---|
| Input | 32x32x3 RGB image |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 28x28x64 |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 14x14x64 |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 10x10x84 |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 5x5x84 |
| flatten | 5x5x48 => 2100 |
| Fully connected | outputs 240 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Fully connected | outputs 240 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Softmax | outputs 43 (class number) |
The training Hyperparameters are same as "middle model".
They are also defined for slow training to prevent overfitting.
| Title | Description |
|---|---|
| Optimizer | Adam |
| learning_rate | 0.0002 |
| batch size | 100 |
| EPOCH Number | 1000 |
Describe how you preprocessed the image data. What techniques were chosen and why did you choose these techniques? Consider including images showing the output of each preprocessing technique. Pre-processing refers to techniques such as converting to grayscale, normalization, etc.
ただいま再計算中 後で jupyter と合わせる
My final model results were:
- training set accuracy of 0.99983
- validation set accuracy of 0.98209
- test set accuracy of 0.96017
ここから書く
largeモデルの認識結果
新たな画像5枚を選んで認識させる
| input image | answer | inference | |
|---|---|---|---|
| O | 4 | 4 : Speed limit (70km/h) | |
| O | 13 | 13 : Yield | |
| X | 17 | 3 : Speed limit (60km/h) | |
| O | 33 | 33 : Turn right ahead | |
| O | 40 | 40 : Roundabout mandatory |
Here are five German traffic signs that I found on the web:
The first image might be difficult to classify because ...
Here are the results of the prediction:
| Image | Prediction |
|---|---|
| Stop Sign | Stop sign |
| U-turn | U-turn |
| Yield | Yield |
| 100 km/h | Bumpy Road |
| Slippery Road | Slippery Road |
The model was able to correctly guess 4 of the 5 traffic signs, which gives an accuracy of 80%. This compares favorably to the accuracy on the test set of ...
The code for making predictions on my final model is located in the 11th cell of the Ipython notebook.
For the first image, the model is relatively sure that this is a stop sign (probability of 0.6), and the image does contain a stop sign. The top five soft max probabilities were
| Probability | Prediction |
|---|---|
| .60 | Stop sign |
| .20 | U-turn |
| .05 | Yield |
| .04 | Bumpy Road |
| .01 | Slippery Road |
For the second image ...
オプション
The difference between the original data set and the augmented data set is the following ...
OPTIONAL: As described in the "Stand Out Suggestions" part of the rubric, if you generated additional data for training, describe why you decided to generate additional data, how you generated the data, and provide example images of the additional data. Then describe the characteristics of the augmented training set like number of images in the set, number of images for each class, etc.
As I got 4 points of view about the trainig data issue as follows.
I take augmenting plans to resolve them as below.
| method | porpose | target labels(class) |
|---|---|---|
| enhance color | low-chroma expansion | 6, 32, 41, 42 |
| add vivid images | low-chroma expansion | 6, 32, 41, 42 |
| random value charge | back ground texture elimination | 16, 19, 20, 24, 30 |
| random position shift | back ground texture elimination | 16, 19, 20, 24, 30 |
| enhance brightness | dark brightness | 3, 5, 6, 7, 10, 20 |
| add bright images | dark brightness | 3, 5, 6, 7, 10, 20 |
| add various images | trainig data shortage | 20, 21, 40 ... |
| ノイズを加える |
Step 4 (Optional): Visualize the Neural Network's State with Test Images This Section is not required to complete but acts as an additional excersise for understaning the output of a neural network's weights. While neural networks can be a great learning device they are often referred to as a black box. We can understand what the weights of a neural network look like better by plotting their feature maps. After successfully training your neural network you can see what it's feature maps look like by plotting the output of the network's weight layers in response to a test stimuli image. From these plotted feature maps, it's possible to see what characteristics of an image the network finds interesting. For a sign, maybe the inner network feature maps react with high activation to the sign's boundary outline or to the contrast in the sign's painted symbol.
Provided for you below is the function code that allows you to get the visualization output of any tensorflow weight layer you want. The inputs to the function should be a stimuli image, one used during training or a new one you provided, and then the tensorflow variable name that represents the layer's state during the training process, for instance if you wanted to see what the LeNet lab's feature maps looked like for it's second convolutional layer you could enter conv2 as the tf_activation variable.
For an example of what feature map outputs look like, check out NVIDIA's results in their paper End-to-End Deep Learning for Self-Driving Cars in the section Visualization of internal CNN State. NVIDIA was able to show that their network's inner weights had high activations to road boundary lines by comparing feature maps from an image with a clear path to one without. Try experimenting with a similar test to show that your trained network's weights are looking for interesting features, whether it's looking at differences in feature maps from images with or without a sign, or even what feature maps look like in a trained network vs a completely untrained one on the same sign image.
def outputFeatureMap(image_input, tf_activation, activation_min=-1, activation_max=-1 ,plt_num=1): # Here make sure to preprocess your image_input in a way your network expects # with size, normalization, ect if needed # image_input = # Note: x should be the same name as your network's tensorflow data placeholder variable # If you get an error tf_activation is not defined it may be having trouble accessing the variable from inside a function activation = tf_activation.eval(session=sess,feed_dict={x : image_input}) featuremaps = activation.shape[3] plt.figure(plt_num, figsize=(15,15)) for featuremap in range(featuremaps): plt.subplot(6,8, featuremap+1) # sets the number of feature maps to show on each row and column plt.title('FeatureMap ' + str(featuremap)) # displays the feature map number if activation_min != -1 & activation_max != -1: plt.imshow(activation[0,:,:, featuremap], interpolation="nearest", vmin =activation_min, vmax=activation_max, cmap="gray") elif activation_max != -1: plt.imshow(activation[0,:,:, featuremap], interpolation="nearest", vmax=activation_max, cmap="gray") elif activation_min !=-1: plt.imshow(activation[0,:,:, featuremap], interpolation="nearest", vmin=activation_min, cmap="gray") else: plt.imshow(activation[0,:,:, featuremap], interpolation="nearest", cmap="gray")
更に
画像の水増し方法をTensorFlowのコードから学ぶ http://qiita.com/Hironsan/items/e20d0c01c95cb2e08b94
per_image_whitening
distortions applied to the image.
Randomly crop a [height, width] section of the image. distorted_image = tf.random_crop(reshaped_image, [height, width, 3])
Randomly flip the image horizontally. distorted_image = tf.image.random_flip_left_right(distorted_image)
the order their operation. distorted_image = tf.image.random_brightness(distorted_image, max_delta=63) distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)
Subtract off the mean and divide by the variance of the pixels. float_image = tf.image.per_image_whitening(distorted_image)